PLSC30500, Fall 2024

Part 2. Summarizing distributions (part b)

Andy Eggers

Summarizing joint distributions

Motivation

Suppose we have two RVs \(X\) and \(Y\)

  • number of heads in one coin flip and number of green balls drawn from urn in 6 tries
  • age and height of randomly selected student
  • whether randomly selected citizen served in military and supports a foreign war

We know the joint PMF/PDF \(f(x, y)\) and joint CDF \(F(x, y)\).

How can we summarize the relationship between \(X\) and \(Y\)?

Covariance

\[\text{Cov}[X, Y] = {\textrm E}\left[ (X - {\textrm E}[X])(Y - {\textrm E}[Y]) \right]\]

Intuitively, “Does \(X\) tend to be above \({\textrm E}[X]\) when \(Y\) is above \({\textrm E}[Y]\)? (And by how much?)”

\[ f(x,y) = \begin{cases} 1/3 & x = 0, y = 0 \\\ 1/6 & x = 1, y = 0 \\\ 1/2 & x = 1, y = 1 \\\ 0 & \text{otherwise} \end{cases} \]

What is \({\textrm E}[X]\)? What is \({\textrm E}[Y]\)?

Then compute expectation of \((X - {\textrm E}[X])(Y - {\textrm E}[Y])\) (function of two RVs) as above.

Variance and covariance

Compare:

\[\begin{align}\text{Cov}[X, Y] &= {\textrm E}\left[ \color{blue}{(X - {\textrm E}[X])}\color{orange}{(Y - {\textrm E}[Y])} \right] \\ {\textrm V}[X] &= {\textrm E}\left[ \color{blue}{(X - {\textrm E}[X])}\color{blue}{(X - {\textrm E}[X])} \right]\end{align}\]

  • Variance of \(X\) is covariance between \(X\) and itself.
  • Variance can’t be negative but covariance can
  • A justification for \(^2\) in variance formula

Geometric representation (1)

Plot the points in \(\text{Supp}[X, Y]\) on two axes with point size proportional to \(f(x, y)\).

Divide the \(x, y\) plane into quadrants defined by \(x = {\textrm E}[X]\) and \(y = {\textrm E}[Y]\).

Geometric representation (2)

For each point \((x, y) \in \text{Supp}[X, Y]\), create a rectangle with \((x,y)\) at one corner and \(({\textrm E}[X], {\textrm E}[Y])\) at the opposite corner.

Shade the rectangle green in quadrants I and III (where \((x - {\textrm E}[X])(y - {\textrm E}[X]) > 0\)), otherwise red, with intensity proportional to \(f(x,y)\).

Covariance (roughly) measures how much green vs red there is.

Geometric representation (3)

Geometric representation (4)

Geometric representation (5)

Alternative formulation

First formulation:

\[\text{Cov}[X, Y] = {\textrm E}\left[ (X - {\textrm E}[X])(Y - {\textrm E}[Y]) \right]\]

As with variance, an alternative formulation:

\[\text{Cov}[X, Y] = {\textrm E}\left[XY\right] - {\textrm E}[X]{\textrm E}[Y]\]

Note:

  • if \({\textrm E}[X] = {\textrm E}[Y] = 0\) (e.g. if recentered), both are \({\textrm E}[XY]\)
  • geometrically, can think in terms of areas of rectangles

Geometry of \({\textrm E}[XY] - {\textrm E}[X]{\textrm E}[Y]\)

i.e. (1,1) and (3,3) equally likely

i.e. (1,3) and (3,1) equally likely

Geometry of \({\textrm E}[X]{\textrm E}[Y]\)

\({\textrm E}[X]{\textrm E}[Y]\) is the area below and to the left of the dashed lines

Geometry of \({\textrm E}[XY]\)

\({\textrm E}[XY]\) is the average of the two areas (given equal probability)

Linearity of expectations, but not variances

If \(g\) is a linear function or linear operator or linear map, then \(g(x + y) = g(x) + g(y)\). (Additivity property.) Examples?


Recall linearity of expectations: \({\textrm E}[X + Y] = {\textrm E}[X] + {\textrm E}[Y]\).


But \(\text{Var}[X + Y] \neq \text{Var}[X] + \text{Var}[Y]\)

Why not?

Variance rule (non-linearity of variance)

A different proof from A&R 2.2.3

\[\begin{aligned} \text{Var}(X+Y) &= {\textrm E}[(X + Y - {\textrm E}[X + Y])^2] \\\ &= {\textrm E}[(X - {\textrm E}[X] + Y - {\textrm E}[Y])^2] \\\ &= {\textrm E}[(w + z)^2] \\\ &= {\textrm E}[w^2 + z^2 + 2 w z] \\\ &= {\textrm E}[w^2] + {\textrm E}[z^2] + {\textrm E}[2 w z] \\\ &= {\textrm E}[(X - {\textrm E}[X])^2] + {\textrm E}[(Y - {\textrm E}[Y])^2] + 2{\textrm E}[(X - {\textrm E}[X])(Y - {\textrm E}[Y])] \\\ &= \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y) \end{aligned}\]

Correlation

The correlation of two RVs \(X\) and \(Y\) with \(\sigma[X] > 0\) and \(\sigma[Y] > 0\) is

\[ \rho[X, Y] = \frac{\text{Cov}[X, Y]}{\sigma[X] \sigma[Y]}\]

Correlation is scale-invariant: \(\rho[X, Y] = \rho[aX, bY]\) for \(a, b > 0\)

Prove it!

Proof of scale-invariance of correlation

\[\begin{align} \text{Cov}[aX, bY] &= {\textrm E}[aX bY] - {\textrm E}[aX]{\textrm E}[bY] \\ &= ab {\textrm E}[XY] - ab {\textrm E}[X]{\textrm E}[Y] \\ &= ab ({\textrm E}[XY] - {\textrm E}[X]{\textrm E}[Y]) \\ &= ab \text{Cov}[X, Y] \end{align}\]

\[\sigma[aX] = \sqrt{\text{V}[aX]} = \sqrt{a^2 \text{V}[X]} = a \sigma[X]\]

By same argument, \(\sigma[bY] = b\sigma[Y]\).

So

\[\begin{align} \rho[aX, bY] &= \frac{\text{Cov}[aX, bY]}{\sigma[aX] \sigma[bY]} \\ &= \frac{ab \text{Cov}[X, Y]}{a \sigma[X] b \sigma[Y]} = \frac{\text{Cov}[X, Y]}{\sigma[X] \sigma[Y]} \\ &= \rho[X, y] \end{align}\]

Conditional expectations

We spent time on expectations:

\[{\textrm E}[Y] = \sum_y y f(y).\]

Also on conditional distributions:

\[f_{Y|X}(y|x) = \frac{f(x, y)}{f_X(x)}\]

Combining the two ideas, we get conditional expectations:

\[{\textrm E}[Y \mid X = x] = \sum_y y f_{Y|X}(y \mid x).\]

i.e. the expectation of \(Y\) at some \(x\).

Illustration

Another illustration

(Red line is \(E[Y | X = x]\), dots are a sample from \(f(x, y)\))]{.gray .smaller}

Illustration (2)

(Red line is \({\textrm E}[Y | X = x]\), dots are a sample from \(f(x, y)\))

Conditional variance

Two formulations:

\[{\textrm V}[Y | X = x] = {\textrm E}[(Y - {\textrm E}[Y | X =x])^2 | X = x]\] \[{\textrm V}[Y | X = x] = {\textrm E}[Y^2 | X = x] - {\textrm E}[Y | X =x]^2\]

Conditional variance (2)

Two formulations:

\[{\textrm V}[Y | X = x] = {\textrm E}[(Y - {\textrm E}[Y | X =x])^2 | X = x]\]

\[{\textrm V}[Y | X = x] = {\textrm E}[Y^2 | X = x] - {\textrm E}[Y | X =x]^2\]

Conditional expectations vs Conditional expectation function (CEF)

Conditional expectation \({\textrm E}[Y | X = x]\) is for a specific \(x\).

Conditional expectation function (CEF) \({\textrm E}[Y | X]\) is for all \(x\).

CEF as best predictor

The CEF \({\textrm E}[Y | X]\) is the expectation of \(Y\) at each \(X\).

We already established that the expectation/mean is the best (in MSE sense) predictor.

So CEF is the best possible way to use \(X\) to predict \(Y\). (See Theorem 2.2.20.)

Multivariate generalization: \({\textrm E}[Y \mid X_1, X_2, X_3, \ldots, X_n]\) is the best way to use \(X_1, \ldots X_n\) to predict \(Y\).

Law of iterated expectations

For random variables \(X\) and \(Y\),

\[{\textrm E}[Y] = {\textrm E}[{\textrm E}[Y | X]]\]

This means there are two ways to get \({\textrm E}[Y]\):

  • start with \(f(y)\), take expectations: \({\textrm E}[Y] = \sum_y y f(y)\)
  • start with \({\textrm E}[Y \mid X]\) and \(f_X(x)\), take expectations: \({\textrm E}[Y] = \sum_x {\textrm E}[Y \mid X=x] f_X(x)\)

In words: An unconditional average (\({\textrm E}[Y]\)) can be represented as a weighted average of conditional expectations (\({\textrm E}[Y \mid X]\)) with weights taken from the distribution of the variable conditioned on, i.e. \(X\).

Why would you want to do that?

LIE: An intuitive example

A population is 80% female and 20% male.

The average age among females (\({\textrm E}[Y | X = 1]\)) is 25. The average age among males \({\textrm E}[Y | X = 0]\) is 20.

What is the average age in the population, \({\textrm E}[Y]\)?

\[{\textrm E}[{\textrm E}[Y | X]] = .8 \times 25 + .2 \times 20 = 24\]

See homework for another example.

LIE: another example

LIE: another example (2)

How LIE is used in causal inference (preview)

Suppose we want to measure the average effect of participating in a program (e.g. job training, voter education, military mobilization).

Call \(Y\) the (unobservable) effect of the treatment. We want the average treatment effect (ATE), \({\textrm E}[Y]\).

Suppose that comparing participants and non-participants gives us a good estimate of the average treatment effect only within subgroups defined by age (\(X\)).

So we have \({\textrm E}[Y \mid X]\).

Now we just combine these estimates (by LIE): \({\textrm E}[Y] = {\textrm E}[{\textrm E}[Y \mid X]] = \sum_{x} {\textrm E}[Y \mid X = x] f(x)\)

Law of total variance

\[{\textrm V}[Y] = {\textrm E}[{\textrm V}[Y|X]] + {\textrm V}[{\textrm E}[Y|X]]\]

In words, the variance of \(Y\) can be decomposed into the expected conditional variance (\({\textrm E}[{\textrm V}[Y|X]]\)) and the variance of the conditional expectation (\({\textrm V}[{\textrm E}[Y|X]]\)).

Sometimes called “Ev(v)e’s law” because

\[{\textrm V}[Y] = \color{red}{{\textrm E}}[\color{red}{{\textrm V}}[Y|X]] + \color{red}{{\textrm V}}[\color{red}{{\textrm E}}[Y|X]]\]

Law of total variance (2)

Best linear predictor (BLP)

Suppose we want to predict \(Y\) using \(X\), and we focus on a linear predictor, i.e. a function of the form \(\alpha + \beta X\).

The best (minimum MSE) predictor satisfies

\[(\alpha, \beta) = \underset{(a,b) \in \mathbb{R}^2}{\arg\min} \, {\textrm E}\,[\left(Y - (a + bX)\right)^2]\]

The solution (see Theorem 2.2.21) is

  • \(\beta = \frac{\textrm{Cov}[X, Y]}{\textrm{V}[X]}\)
  • \(\alpha = {\textrm E}[Y] - \beta {\textrm E}[X]\)

So we could obtain the BLP from a joint PMF. (See homework.)

BLP predicts CEF

Above, we were looking for best linear predictor (BLP) of \(Y\) as function of \(X\):

\[(\alpha, \beta) = \underset{(a,b) \in \mathbb{R}^2}{\arg\min} \, {\textrm E}[\left(Y - (a + bX)\right)^2]\]

Same answer if you look for the best linear predictor of the CEF \(E[Y | X]\):

\[(\alpha, \beta) = \underset{(a,b) \in \mathbb{R}^2}{\arg\min} \, {\textrm E}[\left({\textrm E}[Y|X] - (a + bX)\right)^2]\]